1. Syntactic Patterns for Technical Terms


In [1]:
import nltk
from nltk.corpus import brown

As seen in the Chuang et al. paper and in the Manning and Schuetze chapter, there is a well-known part-of-speech based pattern defined by Justeson and Katz for identifying simple noun phrases that often words well for pulling out keyphrases.

Chuang et al use this pattern: Technical Term T = (A | N)+ (N | C) | N

Below, please write a function to define a chunker using the RegexpParser as illustrated in the section Chunking with Regular Expressions. You'll need to revise the grammar rules shown there to match the pattern shown above. You can be liberal with your definition of what is meant by N here. Also, C refers to cardinal number, which is CD in the brown corpus.


In [2]:
grammar = r"""
    T: {<JJ.*|N.*>+ <N.*|CC|CS>|<N.*>}
"""

t_term = nltk.RegexpParser(grammar)

In [3]:
nltk.help.brown_tagset("CD.*")


CD: numeral, cardinal
    two one 1 four 2 1913 71 74 637 1937 8 five three million 87-31 29-5
    seven 1,119 fifty-three 7.5 billion hundred 125,000 1,700 60 100 six
    ...
CD$: numeral, cardinal, genitive
    1960's 1961's .404's

Below, please write a function to call the chunker, run it on some sentences, and then print out the results for those sentences.

For uniformity, please run it on sentences 100 through 104 from the tagged brown corpus news category.

Then extract out the phrases themselves using the subtree extraction technique shown in the Exploring Text Corpora category. (Note: Section 7.4 shows how to get to the actual words in the phrase by using the tree.leaves() command.)


In [4]:
sents = brown.tagged_sents()[100:105]

In [5]:
parsed_sents = [t_term.parse(s) for s in sents]

In [6]:
tech_phrases = [[t for t in s.subtrees() if t.node=="T"] for s in parsed_sents]

In [7]:
tech_phrases


Out[7]:
[[Tree('T', [('Daniel', 'NP')]),
  Tree('T', [('fight', 'NN')]),
  Tree('T', [('measure', 'NN')]),
  Tree('T', [('rejection', 'NN')]),
  Tree('T', [('previous', 'JJ'), ('Legislatures', 'NNS-TL')]),
  Tree('T', [('public', 'JJ'), ('hearing', 'NN')]),
  Tree('T', [('House', 'NN-TL'), ('Committee', 'NN-TL')]),
  Tree('T', [('Revenue', 'NN-TL')]),
  Tree('T', [('Taxation', 'NN-TL')])],
 [Tree('T', [('committee', 'NN'), ('rules', 'NNS')]),
  Tree('T', [('subcommittee', 'NN')]),
  Tree('T', [('week', 'NN')])],
 [Tree('T', [('questions', 'NNS')]),
  Tree('T', [('committee', 'NN'), ('members', 'NNS')]),
  Tree('T', [('bankers', 'NNS')]),
  Tree('T', [('witnesses', 'NNS')]),
  Tree('T', [('doubt', 'NN'), ('that', 'CS')]),
  Tree('T', [('passage', 'NN')])],
 [Tree('T', [('Daniel', 'NP')]),
  Tree('T', [('estimate', 'NN'), ('that', 'CS')]),
  Tree('T', [('dollars', 'NNS')]),
  Tree('T', [('deficit', 'NN')]),
  Tree('T', [('dollars', 'NNS')]),
  Tree('T', [('end', 'NN')]),
  Tree('T', [('current', 'JJ'), ('fiscal', 'JJ'), ('year', 'NN')]),
  Tree('T', [('Aug.', 'NP')])],
 [Tree('T', [('committee', 'NN')]),
  Tree('T', [('measure', 'NN')]),
  Tree('T', [('means', 'NNS')]),
  Tree('T', [('escheat', 'NN'), ('law', 'NN')]),
  Tree('T', [('books', 'NNS')]),
  Tree('T', [('Texas', 'NP')]),
  Tree('T', [('republic', 'NN')])]]

2. Identify Proper Nouns

For this next task, write a new version of the chunker, but this time change it in two ways:

  1. Make it recognize proper nouns
  2. Make it work on your personal text collection which means that you need to run a tagger over your personal text collection.

Note that the second requirements means that you need to run a tagger over your personal text collection before you design the proper noun recognizer. You can use a pre-trained tagger or train your own on one of the existing tagged collections (brown, conll, or treebank)

Tagger: Your code for optionally training tagger, and for definitely running tagger on your personal collection goes here:


In [8]:
import corpii
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

In [9]:
debates= nltk.clean_html(corpii.load_pres_debates().raw())
sents = sent_tokenizer.sentences_from_text(debates)

In [10]:
def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2

tagger = build_backoff_tagger(brown.tagged_sents())

In [11]:
token_regex= """(?x)
    # taken from ntlk book example
    ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
"""

In [12]:
t_sents = [nltk.regexp_tokenize(s, token_regex) for s in sents]

In [13]:
tagged_sents = [tagger.tag(s) for s in t_sents]
#tags = tagger.tag(tokens)

Chunker: Code for the proper noun chunker goes here:


In [15]:
re_noun_chunk = r"""
    NP: {<NP>+|<NP><IN.*|DT.*><NP>}
"""

np_parser = nltk.RegexpParser(re_noun_chunk)

Test the Chunker: Test your proper noun recognizer on a lot of sentences to see how well it is working. You might want to add prepositions in order to improve your results.


In [16]:
for i in range(0,50):
    print "**********************************************"
    print sents[i]
    print "Proper Nouns:"
    print [t for t in np_parser.parse(tagged_sents[i]).subtrees() if t.node=="NP"]


**********************************************
October 15, 1992 First Half Debate Transcript 
 
 October 15, 1992 
 The Second Clinton-Bush-Perot Presidential Debate (First Half of Debate) 
 This is the first half of the transcript of the Richmond debate.
Proper Nouns:
[Tree('NP', [('October', 'NP')]), Tree('NP', [('October', 'NP')])]
**********************************************
The October 15th "town hall" format debate was moderated by Carole Simpson.
Proper Nouns:
[Tree('NP', [('October', 'NP')]), Tree('NP', [('Simpson', 'NP')])]
**********************************************
She explains the format in her opening remarks.
Proper Nouns:
[]
**********************************************
The length of this printed transcript is approximately 20 pages.
Proper Nouns:
[]
**********************************************
CAROLE SIMPSON: Good evening and welcome to this second of three presidential debates between the major candidates for president of the US.
Proper Nouns:
[]
**********************************************
The candidates are the Republican nominee, President George Bush, the independent Ross Perot and Governor Bill Clinton, the Democratic nominee.
Proper Nouns:
[Tree('NP', [('Republican', 'NP')]), Tree('NP', [('George', 'NP'), ('Bush', 'NP')]), Tree('NP', [('Ross', 'NP')]), Tree('NP', [('Bill', 'NP'), ('Clinton', 'NP')])]
**********************************************
My name is Carole Simpson, and I will be the moderator for tonight's 90-minute debate, which is coming to you from the campus of the University of Richmond in Richmond, Virginia.
Proper Nouns:
[Tree('NP', [('Simpson', 'NP')]), Tree('NP', [('Richmond', 'NP')]), Tree('NP', [('Richmond', 'NP')]), Tree('NP', [('Virginia', 'NP')])]
**********************************************
Now, tonight's program is unlike any other presidential debate in history.
Proper Nouns:
[]
**********************************************
We're making history now and it's pretty exciting.
Proper Nouns:
[]
**********************************************
An independent polling firm has selected an audience of 209 uncommitted voters from this area.
Proper Nouns:
[]
**********************************************
The candidates will be asked questions by these voters on a topic of their choosing -- anything they want to ask about.
Proper Nouns:
[]
**********************************************
My job as moderator is to, you know, take care of the questioning, ask questions myself if I think there needs to be continuity and balance, and sometimes I might ask the candidates to respond to what another candidate may have said.
Proper Nouns:
[]
**********************************************
Now, the format has been agreed to by representatives of both the Republican and Democratic campaigns, and there is no subject matter that is restricted.
Proper Nouns:
[Tree('NP', [('Republican', 'NP')])]
**********************************************
Anything goes.
Proper Nouns:
[]
**********************************************
We can ask anything.
Proper Nouns:
[]
**********************************************
After the debate, the candidates will have an opportunity to make a closing statement.
Proper Nouns:
[]
**********************************************
So, President Bush, I think you said it earlier -- let's get it on.
Proper Nouns:
[Tree('NP', [('Bush', 'NP')])]
**********************************************
PRESIDENT GEORGE BUSH: Let's go.
Proper Nouns:
[]
**********************************************
SIMPSON: And I think the first question is over here.
Proper Nouns:
[]
**********************************************
AUDIENCE QUESTION: Yes.
Proper Nouns:
[]
**********************************************
I'd like to direct my question to Mr. Perot.
Proper Nouns:
[]
**********************************************
What will you do as president to open foreign markets to fair competition from American business and to stop unfair competition here at home from foreign countries so that we can bring jobs back to the US?
Proper Nouns:
[]
**********************************************
ROSS PEROT: That's right at the top of my agenda.
Proper Nouns:
[]
**********************************************
We've shipped millions of jobs overseas and we have a strange situation because we have a process in Washington where after you've served for a while you cash in, become a foreign lobbyist, make $30,000 a month, then take a leave, work on presidential campaigns, make sure you've got good contacts and then go back out.
Proper Nouns:
[Tree('NP', [('Washington', 'NP')])]
**********************************************
Now, if you just want to get down to brass tacks, first thing you ought to do is get all these folks who've got these 1-way trade agreements that we've negotiated over the years and say fellas, we'll take the same deal we gave you.
Proper Nouns:
[]
**********************************************
And they'll gridlock right at that point because for example, we've got international competitors who simply could not unload their cars off the ships if they had to comply -- you see, if it was a 2-way street, just couldn't do it.
Proper Nouns:
[]
**********************************************
We have got to stop sending jobs overseas.
Proper Nouns:
[]
**********************************************
To those of you in the audience who are business people: pretty simple.
Proper Nouns:
[]
**********************************************
If you're paying $12, $13, $14 an hour for a factory worker, and you can move your factory south of the border, pay $1 an hour for labor, hire a young -- let's assume you've been in business for a long time.
Proper Nouns:
[]
**********************************************
You've got a mature workforce.
Proper Nouns:
[]
**********************************************
Pay $1 an hour for your labor, have no health care -- that's the most expensive single element in making the car.
Proper Nouns:
[]
**********************************************
Have no environmental controls, no pollution controls and no retirement.
Proper Nouns:
[]
**********************************************
And you don't care about anything but making money.
Proper Nouns:
[]
**********************************************
There will be a job-sucking sound going south.
Proper Nouns:
[]
**********************************************
If the people send me to Washington the first thing I'll do is study that 2000-page agreement and make sure it's a 2-way street.
Proper Nouns:
[Tree('NP', [('Washington', 'NP')])]
**********************************************
One last point here.
Proper Nouns:
[]
**********************************************
I decided I was dumb and didn't understand it so I called a "Who's Who" of the folks that have been around it, and I said why won't everybody go south; they said it will be disruptive; I said for how long.
Proper Nouns:
[]
**********************************************
I finally got 'em for 12 to 15 years.
Proper Nouns:
[]
**********************************************
And I said, well, how does it stop being disruptive?
Proper Nouns:
[]
**********************************************
And that is when their jobs come up from a dollar an hour to $6 an hour, and ours go down to $6 an hour; then it's leveled again, but in the meantime you've wrecked the country with these kind of deals.
Proper Nouns:
[]
**********************************************
We got to cut it out.
Proper Nouns:
[]
**********************************************
SIMPSON: Thank you, Mr. Perot.
Proper Nouns:
[]
**********************************************
I see that the president has stood up, so he must have something to say about this.
Proper Nouns:
[]
**********************************************
BUSH: Carole, the thing that saved us in this global economic slowdown has been our exports, and what I'm trying to do is increase our exports.
Proper Nouns:
[]
**********************************************
And if indeed all the jobs were going to move south because there are lower wages, there are lower wages now and they haven't done that.
Proper Nouns:
[]
**********************************************
And so I have just negotiated with the president of Mexico the North American Free Trade Agreement -- and the prime minister of Canada, I might add -- and I want to have more of these free trade agreements, because export jobs are increasing far faster than any jobs that may have moved overseas.
Proper Nouns:
[Tree('NP', [('Mexico', 'NP')]), Tree('NP', [('Canada', 'NP')])]
**********************************************
That's a scare tactic, because it's not that many.
Proper Nouns:
[]
**********************************************
But any one that's here, we want to have more jobs here.
Proper Nouns:
[]
**********************************************
And the way to do that is to increase our exports.
Proper Nouns:
[]
**********************************************
Some believe in protection.
Proper Nouns:
[]

Notes

I tried adding prepositions and it didn't really help. There is still more improvement that can be done at the tagging stage. For example, tokens following salutations like "Mr. Perot" are not being seen as proper nouns by the tokenizer.

FreqDist Results: After you have your proper noun recognizer working to your satisfaction, below run it over your entire collection, feed the results into a FreqDist, and then print out the top 20 proper nouns by frequency. That code goes here, along with the output:


In [17]:
trees = [np_parser.parse(s) for s in tagged_sents]
pnouns = [i for t in trees for i in t.subtrees() if i.node=="NP"]

In [18]:
pn_freq = nltk.FreqDist([pn.pprint() for pn in pnouns])

In [19]:
pn_freq.items()[0:20]


Out[19]:
[('(NP America/NP)', 847),
 ('(NP Congress/NP)', 420),
 ('(NP Bush/NP)', 332),
 ('(NP Iraq/NP)', 309),
 ('(NP John/NP)', 261),
 ('(NP Iran/NP)', 204),
 ('(NP Kennedy/NP)', 202),
 ('(NP Washington/NP)', 199),
 ('(NP Republican/NP)', 190),
 ('(NP Carter/NP)', 189),
 ('(NP Ford/NP)', 172),
 ('(NP Jim/NP)', 139),
 ('(NP Clinton/NP)', 137),
 ('(NP Israel/NP)', 129),
 ('(NP Nixon/NP)', 126),
 ('(NP China/NP)', 123),
 ('(NP Bob/NP)', 121),
 ('(NP George/NP Bush/NP)', 111),
 ('(NP Gore/NP)', 109),
 ('(NP Bill/NP Clinton/NP)', 99)]

For Wednesday

Just FYI, in Wednesday's October 8's assignment, you'll be asked to extend this code a bit more to discover interesting patterns using objects or subjects of verbs, and do a bit of Wordnet grouping. This will be posted soon. Note that these exercises are intended to provide you with functions to use directly in your larger assignment.


In [18]: